2,467 research outputs found

    ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

    Get PDF
    We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usage with other implementations. The new software proves to scale best with the number of features, samples, trees, and features tried for splitting. Finally, we show that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study

    Drought effects on biofuel feedstock production by Populus trichocarpa

    Get PDF
    As the world population continues to increase, so does the need for sustainable sources of fuel. Biofuels are of particular interest and could be an economically feasible fuel source given the right conditions. Populus trichocarpa, is a rapidly growing plantation species that, in addition to having a fully sequenced genome available for study, displays a wide range of phenotypic traits among genotypes. By analyzing these differences in both plantation and more controlled greenhouse settings, we aimed to discover which genotypes performed the best under drought conditions, and which physiological mechanisms granted them that high performance. In the field, differences in heights and stress tolerance among genotypes were observed, and 60 genotypes of differing water-limitation resistance were selected for further measures. No differences between resistance groups were seen in the physiological measures taken, yet the more resistant genotypes had higher stress tolerances indices and grew taller than susceptible genotypes from similar latitudes. The greenhouse study confirmed the water-limitation resistance rankings for 80% of the genotypes and found that resistant genotypes expressed greater midday stomatal control, enabling them to conserve water. Despite this temporary shutdown to photosynthesis, resistant genotypes assimilate carbon at a higher rate than the susceptible genotypes and can maintain their growth advantage. The quick response rate to water-limited conditions correlates with latitude and water availability of the collection site for the clones, suggesting that clones that do not regularly experience water-limitation are more sensitive to it and are able to make short-term adaptations to avoid such conditions. Further evaluation will be needed to examine if these short-term adaptations can maintain growth over extended periods of drought or on marginal lands in order for these genotypes to be a viable candidate for a rotational crop used for biofuel production

    Block Forests:random forests for blocks of clinical and omics covariate data

    Get PDF
    Background In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. Results We identify one variant termed “block forest” that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. Conclusions The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type

    arfpy: A python package for density estimation and generative modeling with adversarial random forests

    Full text link
    This paper introduces arfpy\textit{arfpy}, a python implementation of Adversarial Random Forests (ARF) (Watson et al., 2023), which is a lightweight procedure for synthesizing new data that resembles some given data. The software arfpy\textit{arfpy} equips practitioners with straightforward functionalities for both density estimation and generative modeling. The method is particularly useful for tabular data and its competitive performance is demonstrated in previous literature. As a major advantage over the mostly deep learning based alternatives, arfpy\textit{arfpy} combines the method's reduced requirements in tuning efforts and computational resources with a user-friendly python interface. This supplies audiences across scientific fields with software to generate data effortlessly.Comment: The software is available at https://github.com/bips-hb/arfp

    [Editorial] Special issue: Artificial intelligence in genomics

    Get PDF

    Testing Conditional Independence in Supervised Learning Algorithms

    Get PDF
    We propose the conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. Building on the knockoff framework of Cand\`es et al. (2018), we develop a novel testing procedure that works in conjunction with any valid knockoff sampler, supervised learning algorithm, and loss function. The CPI can be efficiently computed for high-dimensional data without any sparsity constraints. We demonstrate convergence criteria for the CPI and develop statistical inference procedures for evaluating its magnitude, significance, and precision. These tests aid in feature and model selection, extending traditional frequentist and Bayesian techniques to general supervised learning tasks. The CPI may also be applied in causal discovery to identify underlying multivariate graph structures. We test our method using various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and simulated datasets. Simulations confirm that our inference procedures successfully control Type I error and achieve nominal coverage probability. Our method has been implemented in an R package, cpi, which can be downloaded from https://github.com/dswatson/cpi
    corecore